NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests;

Draesslerova, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Benjamin; Manzini, Giovanni; Navarro, Gonzalo (July 2024, SEA 2024)

Full Text Available
A simple grammar-based index for finding approximately longest common substrings;

Gagie, Travis; Kashgouli, Sana; Navarro, Gonzalo (September 2023, SPIRE 2023)

Full Text Available
Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

https://doi.org/10.4230/LIPIcs.SEA.2024.10

Draesslerová, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Liberti, Leo (Ed.)
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
more » « less
Full Text Available
A Fast and Small Subsampled R-Index

https://doi.org/10.4230/LIPIcs.CPM.2021.13

Cobas, Dustin; Gagie, Travis; Navarro, Gonzalo (January 2021, Leibniz international proceedings in informatics)
null (Ed.)
Full Text Available
PHONI: Streamed Matching Statistics with Multi-Genome References

https://doi.org/10.1109/DCC50243.2021.00027

Boucher, Christina; Gagie, Travis; Tomohiro, I; Koppl, Dominik; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo; Pacheco, Alejandro; Rossi, Massimiliano (March 2021, Data Compression Conference)

Full Text Available
Practical Random Access to SLP-Compressed Texts

https://doi.org/10.1007/978-3-030-59212-7_16

Gagie, Travis; I, Tomohiro; Manzini, Giovanni; Navarro, Gonzalo; Sakamoto, Hiroshi; Seelbach Benkner, Louisa; Takabatake, Yoshimasa (January 2020, SPIRE 2020)
null (Ed.)
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.
more » « less
Full Text Available

Search for: All records